Cost Functions

Cost functions

J(\theta) = \frac{1}{m} \sum { #m} L (f(x_i), y_i)

where L is the loss function.

Important

Cost vs. Loss:
loss applies to a single training sample; cost is the mean of summed loss.

Forms of cost functions

Note that the error can be defined in different ways:

Mean Squared Error=(xx^)2Absolute Error=|xx^|Zero-One Loss={0,if x=x^1,otherwise
Important

  • Find more types of error in Error Metrics.
  • In ML, Mean Squared Error is commonly used as the cost function, but with an extra division by 2, which "is just meant to make later partial derivation in gradient descent neater" :

J(\theta) = \frac{1}{2m} \sum { #m} (\hat{x_i} - x_i)^2

Cost function with regularization

When you choose Regularization, a regularization term will be added to the cost function, in order to add penalty and avoid overfitting.

J(\theta) = \frac{1}{2m} \sum_i { #m} (\hat{x_i} - x_i)^2 + \frac{\lambda}{2m} \sum_j^n \theta_j ^ 2 J(\theta) = \frac{1}{m} \sum { #m} ( -y_i log(f(x_i)) - (1-y_i)log(1 - f(x_i))) + \frac{\lambda}{2m} \sum_j^n \theta_j ^ 2

where j represents the jth feature.

The Hundred-Page Machine Learning Book

  • In practice, L1 regularization produces a sparse model, a model that has most of its parameters equal to zero, provided the hyperparameter C is large enough. So L1 performs feature selection by deciding which features are essential for prediction and which are not. That can be useful in case you want to increase model explainability.
  • However, if your only goal is to maximize the performance of the model on the holdout data, then L2 usually gives better results. L2 also has the advantage of being differentiable, so gradient descent can be used for optimizing the objective function.

Loss and cost for different functions

Loss and cost for linear regression -> Analytic solution

in matrix form, with Y=Xw+b, the loss function is

||yXw||2$$Whenthederivativeofthelossis0toachievetheminimumoftheloss:$$w||yXw||2=2XT(Xwy)=0$$thesolutionis$$w=(XTX)1XTy

The solution will only be unique when the matrix XTX is invertible, i.e., when the columns of X are linearly independent.

Loss and cost for logistic regression

MSE is not proper because the cost function would not be convex.

loss function L(f(xi),yi):

yi=1:L=log(f(xi))yi=0:L=log(1f(xi))

Combined the above formula together, we get the simplified cost function for logistic regression:

L=yilog(f(xi))(1yi)log(1f(xi))

then the cost function with full form (also used in Maximum likelihood estimation for logistic regression):

J(\theta) = \frac{1}{m} \sum { #m} ( -y_i log(f(x_i)) - (1-y_i)log(1 - f(x_i)))

Loss and cost for Softmax

What is Softmax: Artificial Neural Networks#^6ef895

The loss function associated with Softmax, the cross-entropy loss, is:

(3)L(a,y)={log(a1),if y=1.log(aN),if y=N

Only the line that corresponds to the target contributes to the loss, other lines are zero:
$$\mathbf{1}{y == n} = =\begin{cases}
1, & \text{if y==n}.\
0, & \text{otherwise}.
\end{cases}$$
Cost function:

(4)J(w,b)=1m[i=1mj=1N1{y(i)==j}logezj(i)k=1Nezk(i)]

where m is the number of examples, N is the number of outputs. This is the average of all the losses.

Important

Cross-entropy takes the full distribution into account.

Expected loss function

A posterior distribution tells us about the confidence or credibility we assign to different choices. A cost function describes the penalty we incur when choosing an incorrect option. These concepts can be combined into an expected loss function.

Expected loss is defined as:

E [Loss| x^ ]= L [ x^,x ] p(x| x~)dx

where L [ x^,x ] is the loss function, p(x| x~) is the posterior, and   represents the Hadamard Product (i.e., elementwise multiplication), and  E [ Loss| x^ ] is the expected loss.

- The posterior's mean minimizes the mean-squared error.
- The posterior's median minimizes the absolute error.
- The posterior's mode minimizes the zero-one loss.

Good Practice in minimizing loss function